Medical Data Integration Pipeline

About This Project

Purpose: This demonstration project showcases the capability to transform messy hospital data into FHIR-compliant format with comprehensive quality assurance.

Approach: Uses official German MII (Medizininformatik-Initiative) test data to demonstrate understanding of the German medical informatics ecosystem and interoperability standards.

Project Overview

This project demonstrates a complete ETL pipeline that transforms messy hospital data into HL7 FHIR R4 compliant resources, using official German MII test data from the Medizininformatik-Initiative.

"Reverse Engineering" Approach for Validation

Started with clean MII FHIR → Created messy CSV with realistic quality issues → Transformed back to clean FHIR → Validated against original gold standard

                            Phase 1
                            Extract & "Break"
                            FHIR → Messy CSV
                        

                            Phase 2
                            Transform & Clean
                            CSV → Clean FHIR
                        

                            Phase 3
                            Validate
                            Compare to Gold Standard
                        

                            Result
                            Quality Dashboard
                            Metrics & Visualizations
                        

Data Quality Issues Handled

ID Formats P-001, PAT001, 00001 → PAT-00001

Date Formats DD.MM.YYYY, YYYY-MM-DD → ISO 8601

Gender Values M, male, männlich → standardized

Missing Data 10-15% nulls → handled gracefully

German Characters ä, ö, ü, ß → UTF-8 encoding

Reference Integrity Invalid IDs → logged & flagged

Tech Stack: Python • pandas • fhir.resources (FHIR R4) • pytest • JavaScript/CSS/HTML

Data Source: Official MII Kerndatensatz Test Data (1,650 FHIR bundles)

Transformation Results

Success Rates by Resource Type

Resource Counts

Patients 200 / 200

Conditions 400 / 400

Medications 115 unique

MedicationAdministrations 181 / 181

Total FHIR Resources 896

Transformation Viewer

See exactly how the pipeline transforms messy CSV input into clean FHIR R4 resources. Each example uses real data from the pipeline.

Messy CSV Input

PatientIDWP1-00001

FirstName(empty)

LastNameMustermann

Birthdate01.01.1950

GenderMale

Street(empty)

CityBonn

PostalCode(empty)

→

Clean FHIR R4 Output

id00001

given(omitted)

familyMustermann

birthDate1950-01-01

gendermale

line(omitted)

cityBonn

postalCode(omitted)

Transformations applied:

ID normalized: WP1-00001 → 00001 (stripped prefix)
Date converted: 01.01.1950 (DD.MM.YYYY) → 1950-01-01 (ISO 8601)
Gender standardized: Male → male (FHIR ValueSet)
Missing fields (FirstName, Street, PostalCode) gracefully omitted from FHIR output

Messy CSV Input

PatientID00004

FirstName(empty)

LastNameMustermann

Birthdate1953/01/01

GenderW

StreetMusterstraße 1

CityBonn

PostalCode53121

→

Clean FHIR R4 Output

id00004

given(omitted)

familyMustermann

birthDate1953-01-01

genderfemale

lineMusterstraße 1

cityBonn

postalCode53121

Transformations applied:

Date converted: 1953/01/01 (YYYY/MM/DD) → 1953-01-01 (ISO 8601)
Gender mapped: W (German abbreviation for weiblich) → female (FHIR ValueSet)
German character ß preserved correctly in UTF-8 address

Messy CSV Input

PatientID00006

FirstNameMaxi_03

LastName(empty)

Birthdate(empty)

Genderweiblich

Street(empty)

CityBonn

PostalCode53121

→

Clean FHIR R4 Output

id00006

givenMaxi_03

family(omitted)

birthDate(omitted — no data)

genderfemale

line(omitted)

cityBonn

postalCode53121

Transformations applied:

Gender mapped: weiblich (German for female) → female (FHIR ValueSet)
Missing LastName, Birthdate, Street — gracefully omitted from FHIR resource
This patient demonstrates maximum missing data handling (4 of 8 fields empty)

Messy CSV Input

ConditionIDWP1-1

PatientIDPAT00002

CodeM80.01

CodeSystem(empty)

DisplayOsteroporose

RecordedDate2019/01/05

→

Clean FHIR R4 Output

resourceTypeCondition

code.systemicd-10-gm

code.codeM80.01

code.displayOsteroporose

subjectPatient/00002

clinicalStatusactive

Transformations applied:

Missing CodeSystem inferred from ICD-10 code format → http://fhir.de/CodeSystem/bfarm/icd-10-gm
Patient reference normalized: PAT00002 → Patient/00002
Clinical status defaulted to active (not present in CSV)
Date converted: 2019/01/05 → ISO 8601 format

Messy CSV Input

MedicationIDPAT1

PatientIDWP1-00001

MedicationCodeN06AA09

MedicationNameATC_AMITRYPTILIN

DoseValue1

DoseUnit(empty)

Status(empty)

EffectiveDate01-01-2019

→

Clean FHIR R4 Output

resourceTypeMedicationAdministration

statuscompleted

medicationMedication/Medication-N06AA09

subjectPatient/00001

occurenceDateTime2019-01-01

Transformations applied:

Patient reference normalized: WP1-00001 → Patient/00001
ATC code N06AA09 linked to Medication resource reference
Missing Status defaulted to completed
Date converted: 01-01-2019 (DD-MM-YYYY) → 2019-01-01 (ISO 8601)

Data Quality Metrics

Validation Against Original FHIR

Quality Scores (click to expand)

Gender Match 89.0% (178/200)

Correctly matched 178 patients

Missing in source CSV (~11% randomly nulled) 22 patients

Input variations handled M, m, Male, male, W, Female, weiblich, männlich

Gap is entirely due to deliberately introduced missing values (~11% null rate). All present gender values were correctly mapped to the FHIR ValueSet (male/female).

Name Match 86.5% (173/200)

Correctly matched 173 patients

Missing FirstName in CSV 23 patients (11.5%)

Missing LastName in CSV 27 patients (13.5%)

Name mismatches are caused by deliberately nulled fields. When either first or last name is missing from the messy CSV, the reconstructed FHIR name cannot match the original. This is expected and by design.

Birthdate Match 67.5% (135/200)

Correctly matched 135 patients

Missing in source CSV (~13% randomly nulled) 26 patients

Unparseable date formats 26 patients

Ambiguous dates (DD-MM vs MM-DD) ~13 patients

Formats successfully parsed DD.MM.YYYY, YYYY/MM/DD, YYYY-MM-DD

The lowest score, and intentionally so: dates are the hardest data quality challenge. Ambiguous formats like 01-05-2019 (Jan 5 or May 1?) cannot be resolved without context. This is a realistic real-world ETL problem.

ICD-10-GM Codes 91.5% (366/400)

Valid ICD-10-GM codes 366 conditions

Missing codes in source CSV (~8.5% nulled) 34 conditions

Missing CodeSystem (inferred from code) 47 conditions

When the CodeSystem column was empty but a valid ICD-10 code was present (e.g. M80.01), the pipeline inferred http://fhir.de/CodeSystem/bfarm/icd-10-gm from the code format.

Overall Data Recovery 87.7% EXCELLENT

Quality Score 100/100 PASSED

Transformation Errors 0

Data Quality Issues Handled

Issue Type	Example Input (Messy)	Output (Clean FHIR)	Status
ID Format Variations	`P-00001, PAT00002, 00003, WP1-00004`	`00001, 00002, 00003, 00004`	Normalized
Date Format Variations	`01.01.1950, 1950/01/01, 01-01-1950`	`1950-01-01 (ISO 8601)`	Normalized
Gender Inconsistencies	`M, m, male, männlich, W, weiblich`	`male, female (FHIR valueSet)`	Normalized
Missing Data	`~12% random missing values`	`Handled with defaults or omitted`	Partial
German Characters	`Müller, Schröder, Weiß`	`UTF-8 preserved`	Preserved

Requirements Alignment

How this project and my broader experience map to core data engineering requirements.

ETL/ELT Pipelines
Spark & NiFi

Project Complete ETL pipeline: FHIR → messy CSV → clean FHIR with validation. Production architecture section shows Spark + NiFi design.

Experience Large-scale ETL/ELT pipelines for ingesting, cleaning, enriching and indexing 413K+ texts into NoSQL databases and Elasticsearch.

SQL & Databases
Datenbanken-Know-How

Project HAPI FHIR Server (PostgreSQL) proposed in production architecture; data modeling for FHIR resources.

Experience Flask web application with SQL and NoSQL databases. Elasticsearch query design and index optimization for large-scale collections.

Medical Data Standards
FHIR, LOINC, OMOP

Project HL7 FHIR R4 resources, ICD-10-GM codes, ATC classification, MII Kerndatensatz profiles. OMOP CDM in production architecture.

Experience Analogous standards work: Dublin Core metadata, cross-source harmonisation, schema mapping, and semantic normalisation across heterogeneous datasets.

Programming
Python, SQL, R, Java

Project Python (pandas, fhir.resources, pydantic), pytest, JavaScript/HTML/CSS. ~2,500 lines of tested code.

Experience 2+ years production Python (Flask, NLP, ML). SQL for relational database work.

Distributed & Cloud
Kubernetes, Spark, Ansible

Project Production architecture proposes Spark cluster with horizontal scaling and fault tolerance.

Experience Deployment to Flemish Supercomputer Center (VSC) Tier-1 cloud: environment configuration, deployment coordination, operational constraints.

Quality Assurance
Qualitätssicherung

Project Validation against gold standard, data quality drill-downs, 87.7% recovery rate with full transparency on gaps and causes.

Experience End-to-end data quality throughout research pipelines. Published digital humanities work requiring rigorous, auditable data provenance.

Documentation
Lebende Dokumentation

Project This interactive dashboard is the living documentation: transformation viewer, quality drill-downs, architecture rationale.

Experience Extensive documentation practice across research publications. Structured, goal-oriented communication with both technical and non-technical stakeholders.

Healthcare Domain
Medical context

Project German medical informatics context: MII test data, German terminology, ICD-10-GM. Built from understanding of real DIC workflows.

Experience PhD in History of Medicine. Healthcare consulting (social determinants of health modeling). Python desktop application for a clinical researcher.

Project = demonstrated in this repository Experience = from professional background

From Demo to Production Architecture

This demo uses a single-threaded Python pipeline. Below is how the same logic would be architected for production-scale medical data integration at a university hospital.

📝

Manual Execution

python3 src/pipeline.py

→

🐍

Python Script

Sequential processing
1 core, ~2 GB RAM

→

📄

CSV → FHIR

200 patients
pandas + fhir.resources

→

💾

Local Filesystem

JSON files on disk

Suitable for demonstration, prototyping, and validation of transformation logic.

Proposed Architecture

🔄

Apache NiFi

Orchestration & ingestion
Data provenance tracking

→

⚡

Apache Spark

Distributed transformation
4–8 nodes, 200K+ patients

→

🏥

HAPI FHIR Server

Production data store
REST API + search

→

📊

Analytics Layer

Parquet / Data Lake
Research queries

Click any component to learn why it is needed

Aspect	Demo Pipeline	Production Pipeline
Orchestration	Manual CLI execution	NiFi flow-based scheduling & monitoring
Processing	Single-threaded Python / pandas	Spark distributed across cluster nodes
Scale	200 patients (~30 s)	200K+ patients, horizontal scaling
Storage	Local JSON files	HAPI FHIR Server + Parquet data lake
Fault Tolerance	Script re-run on failure	Automatic retry, checkpointing, dead-letter queues
Monitoring	Console output + log files	NiFi dashboard, Spark UI, alerting
Research Export	JSON files for validation	OMOP CDM + FHIR Bulk Export for multi-site studies

Based on industry best practices for FHIR data integration at German university hospitals and the MII consortium architecture.

Technical Implementation

Technologies Used

Programming Language Python 3.11

FHIR Library fhir.resources 7.1.0

Data Processing pandas, pydantic

Standards FHIR R4, ICD-10-GM, ATC

Lines of Code ~2,500

Key Features

Data quality validation (completeness, consistency)
Referential integrity checks
German medical terminology (ICD-10-GM, ATC)
MII Kerndatensatz profile compliance
Comprehensive error handling
Validation against source data

How to Run This Project

Quick Start

# 1. Clone the repository
git clone https://codeberg.org/YOUR-USERNAME/YOUR-REPO
cd YOUR-REPO

# 2. Install dependencies
pip install -r requirements.txt

# 3. Run the extraction (FHIR → CSV)
python3 src/fhir_to_csv_extractor.py

# 4. Run the transformation pipeline (CSV → FHIR)
PYTHONPATH=. python3 src/pipeline.py

# 5. Validate against original FHIR
PYTHONPATH=. python3 src/validate_against_original.py

# 6. View this dashboard
open dashboard/index.html
                    

Requirements: Python 3.9+, ~2GB disk space for test data
Execution time: ~2-3 minutes for full pipeline (200 patients)

About This Project

Project Overview

"Reverse Engineering" Approach for Validation

Data Quality Issues Handled

Transformation Results

Success Rates by Resource Type

Resource Counts

Transformation Viewer

Data Quality Metrics

Validation Against Original FHIR

Quality Scores (click to expand)

Data Quality Issues Handled

Requirements Alignment

From Demo to Production Architecture

Technical Implementation

Technologies Used

Key Features

How to Run This Project

Quick Start

🔄 Apache NiFi — Orchestration Layer

⚡ Apache Spark — Processing Layer

🏥 HAPI FHIR Server — Data Store

📊 Analytics Layer — Research & Reporting